This script takes the Round 1 IDG-DREAM results and performs the following steps:

-Basic submission stats -Rescoring the results using reticulate connection to scoring harness python script -Plotting the rescored results to ensure they are in line with final leaderboard -Bootstrap evaluation of the predictions - resampling with replacement - to assess stability of results for all six metrics.

Basic Submission Stats

Round 1 had 170 total submissions from 78 different Synapse Users and 30 teams.

Number of submissions in Round 1 summary and by submitter:

table(leaderboard$userName) %>% as.data.frame() %>% arrange(-Freq) %>% count(Freq)
## # A tibble: 3 x 2
##    Freq     n
##   <int> <int>
## 1     1    20
## 2     2    24
## 3     3    34
library(DT)
DT::datatable(table(leaderboard$userName) %>% as.data.frame() %>% arrange(-Freq))

###Histograms of scores

ggplot(leaderboard)+
  geom_histogram(aes(x=pearson), binwidth = 0.01)

ggplot(leaderboard)+
  geom_histogram(aes(x=spearman), binwidth = 0.01)
## Warning: Removed 3 rows containing non-finite values (stat_bin).

ggplot(leaderboard)+
  geom_histogram(aes(x=log(rmse)), binwidth = 0.01)

ggplot(leaderboard)+
  geom_histogram(aes(x=ci), binwidth = 0.01)

ggplot(leaderboard)+
  geom_histogram(aes(x=f1), binwidth = 0.01)

ggplot(leaderboard)+
  geom_histogram(aes(x=average_AUC), binwidth = 0.01)

Local rescoring validation

All six rescored metrics using this script match those from the final leaderboard. There are a couple of scores that are very slightly different - this is likely due to different library versions than scoring harness. I am not considering this as an important difference for this preliminary analysis.

Bootstrapping

To bootstrap a given prediction file this script: randomly samples 430 times from prediction file. Compute the 6 metrics for those random predictoins Repeat 20x per prediction file to generate a distribution of bootstrapped scores per prediction file. I then plotted the top 20 predictions for each metric using the leaderboard value, superimposed on the distribution of the bootstrapped prediction. Bars are ranked best to worst performer (based on single leaderboard value). Diamonds are the actual leaderboard value.

Spearman Correlation

Pearson Correlation

RMSE

CI

F1

Average AUC

Spearman Correlation - all samples

Pearson Correlation - all samples

RMSE - all samples

CI - all samples

F1 - all samples

Average AUC - all samples